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Abstract. The 2dF Galaxy Redshift Survey (2dFGRS) has already measured over 200,000 
redshifts of nearby (median redshift z ~ 0.1) galaxies. This is the single largest set of galaxy 
spectra ever collected. It allows us to subdivide the survey into subsets according to the galaxy 
intrinsic properties. We outline here a method (based on Principal Component Analysis) for 
spectrally classifying these galaxies in a robust and independent manner. In so doing we develop 
a continuous measure of spectral type, r/, which reflects the actual distribution of spectra in the 
observed galaxy population. We demonstrate the usefulness of this classification by estimating 
luminosity functions and clustering per spectral type. 



1 Introduction 

The 2dF Galaxy Redshift Survey (2dFGRS) is an ambitious project with the aim of mapping the 
galaxy distribution of the local universe more accurately than ever before |||]. It aims to acquire 
a complete sample of ~250,000 galaxy spectra, down to an extinction corrected magnitude limit of 
bj < 19.45. The survey is now approaching completion with over 200,000 galaxy redshifts having 
already been measured (as of October 2001), and is already starting to yield significant scientific 
results (e.g 0, |l|, @, §). 

Apart from the main science goals of quantifying the large-scale structure of the Universe, one of 
the most significant contributions of the 2dFGRS is to our understanding of the galaxy population 
itself. Having a data set of 250,000 galaxy spectra allows us to test the validity of galaxy formation 
and evolution scenarios with unprecedented accuracy. However, the shear size of the data set presents 
us with its own unique problems. Clearly in order to make the data set more 'digestible' some form 
of data compression is necessary, whether that be in the form of equivalent width measurements, 
morphological types, colours or some other compression. These quantities can be compared with 
theoretical predictions and simulations, and hence can set constraints on scenarios for galaxy formation 
and biasing. The approach we outline here is the development of a spectral classification derived from 
the distribution of the galaxy spectra themselves, using a Principal Component Analysis (PGA). Using 
the PGA as the basis of our classification has distinct advantages over other methods. Most notably 
the data are allowed to 'speak for themselves' rather than having ad-hoc classifications imposed on 
them. In addition, unlike equivalent widths or colours the PGA uses all the information contained 
in any given spectrum. The advantage of this is not only the fact that the classification is more 
informative but also that it is more useful at lower signal-to-noise ratios where individual spectral 
channels become noisy. Other data-compression and classification methods have been considered 
recently in other studies (e.g. |jl^, p, 0]). 

In these proceedings we briefly outline the development of our spectral classification and demon- 
strate several applications to which it has been put. For more details see ||9[. 

2 Spectral Classification 

2.1 Principal Component Analysis 

Principal component analysis (PGA) is a well established statistical technique which has proved very 
useful in dealing with highly dimensional data sets. In the particular case of galaxy spectra we 
are typically presented with approximately 1000 spectral channels per galaxy, however when used in 
applications this is usually compressed down to just a few numbers, either by integrating over small 



Table 1: The relative importance (measured as variance) of the first 8 principal components. 
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line features - yielding equivalent widths - or over wide colour filters. The key advantage of using 
PCA in our data compression is that it allows us to make use of all the information contained in the 
spectrum in a statistically unbiased way, i.e. without the use of such ad hoc filters. 

In order to perform the PCA on our galaxy spectra we first construct a representative volume 
limited sample of the galaxies. When we apply the PCA to this sample it constructs an orthogonal 
set of components (eigenspectra) which span the wavelength space occupied by the galaxy spectra. 
These components have been specifically chosen by the PCA in such a way that as much information 
(variance) is contained in the first eigenspectrum as possible, and that the amount of the remaining 
information in all the subsequent eigenspectra is likewise maximised. Therefore, if the information 
contained in the first n eigenspectra is found to be significantly greater than that in the remaining 
eigenspectra we can significantly compress the data set by swapping each galaxy spectrum (described 
by 1000 channels) with its projections onto just those n eigenspectra. 

In the case of galaxy spectra we find that approximately two thirds of the total variance (including 
the noise) in the spectra can be represented in terms of only the first two projections (pci, pc2). So, 
at least to a first approximation, galaxy spectra can be thought of as a two dimensional sequence in 
terms of these two projections. 

In Fig. |l] we show these first two eigenspectra. It can be seen from this figure that whilst the first 
eigenspectrum contains both information from the continuum and lines, the second is dominated by 
the latter. Because of this it is possible to take two simple linear combinations which isolate either 
the continuum or the emission/absorption line features. In effect what we are doing when we utilise 
these linear combinations is rotating the axes defined by the PCA to make the interpretation of the 
components more straightforward. In so doing we can see that a parameterisation in terms of pci 
and pc2 is essentially equivalent to a two dimensional sequence in colour (continuum slope) and the 
average emission/absorption line strength. 

2.2 r] Parameterisation 

The 2dF instrument was designed to measure large numbers of redshifts in as short an observing 
time as possible. However, in order to optimise the number of redshifts that can be measured in 
a given period of time, compromises have had to be made with respect to the spectral quality of 
the observations. Therefore if one wishes to characterise the observed galaxy population in terms of 
their spectral properties care must be taken in order to ensure that these properties are robust to the 
instrumental uncertainties. 

The 2dF instrument makes use of up to 400 optical fibres with a diameter of 140^m (corresponding 
to ~ 2.0" on the sky, depending on plate position). The quality and representativeness of the observed 
spectra can be compromised in several ways and a full list is presented in previous work Q. The net 
effect is that the uncertainties introduced into the fibre-spectra predominantly effect the calibration 
of the continuum slope and have relatively little impact on the emission/absorption line strengths. 
For this reason any given galaxy spectrum which is projected into the plane defined by {pci,pc2) will 
not be uniquely defined in the direction of varying continuum but will be robust in the orthogonal 
direction (which measures the average line strength). 

This robust axis is simply that shown in Fig. 1(b) and we denote the projection onto this eigen- 
spectrum by r], 

■q = a-pci+pc2 ■ (1) 

Where a is a constant which we find empirically to be a = 0.5 ± 0.1. 

The result of this analysis is that we have now identified the single most (statistically) important 
component of the galaxy spectra which is robust to the known instrumental uncertainties, rj. We have 




Figure 1: The first two principal components identified from tlie 2dF galaxy spectra and the linear 
combinations which either (a) isolate the continuum slope or (b) the line features. 



therefore chosen to adopt this (continuous) variable as our measure of spectral type. We show the 
observed distribution of r] projections for the 2dF galaxies in Fig. ^ together with the projections 
calculated from the Kennicutt Atlas Q galaxies which have known morphologies. It can be seen from 
this figure that a clear trend exists between r] and galaxy morphologies. 

Qualitatively, rj is a measure of the relative absorption/emission line strength present in a given 
galaxies spectrum, and hence can be interpreted as a measure of the relative current star-formation 
in that galaxy. To highlight this relationship we show in Fig. ^ the strong correlation between the 
EW(Ha) and tj. 

It is interesting to note that the distribution displays some degree of bimodality similar to that 
observed in the Sloan colour distribution [16|. 



3 Applications 

We show below three applications of t]. They illustrate how r], which relates to atomic processes, is 
connected to the global statistic of galaxies and their clustering on the ~ 10h~^ Mpc scale. 

3.1 Population Mix in 2dF vs. 2MASS 

Having a continuous measure of the spectral type of a galaxy allows us to easily compare the population 
mix of different samples of galaxies. As an example of this we compare the ry distribution of the 
2dFGRS with that of the overlapping 2MASS galaxy sample |l| {Jxron < 14.45) in Fig. |. It is clear 
from this figure that the two populations have quite different compositions. The Near-IR selected 
2MASS galaxies are dominated by early-type galaxies (the peak roughly corresponds to galaxies with 
E/SO morphologies) which make up approximately two thirds of the sample. This contrasts with the 
bj selected 2dFGRS galaxies which contain a much larger selection of star-forming galaxies. 

Such comparisons between samples of galaxies are crucial in order to identify the different selection 
biases in galaxy surveys, and hence ensure that comparisons of subsequent results (e.g. luminosity 
functions) are fair. 

3.2 Luminosity Functions per Spectral Type 

Because is a continuous measure of the spectral type we are now in a position to develop bivariate 
formalisms for such applications as the luminosity function 0(M, rj). However, in order to simplify our 




Figure 2: The distribution of rj projections in the observed 2dF galaxies (left). Also shown in the 
same plot are the r] projections calculated for a subset of the Kennicutt Atlas galaxies. The second 
plot shows the strong correlation between rj and EW(Hq;) (right). 




Figure 3: The distribution of the spectral type parameter, rj, in the full 2dFGRS and the matched 
2MASS catalogue Q. We see that about two thirds of the 2MASS sample are early-type galaxies. 
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Figure 4: The LF derived using the entire galaxy sample is shown in the top left plot. The top right 
plot shows how this relates to the summation of the individual LFs for each spectral type, which are 
shown in the remaining panels. Types 1 to 4 correspond to a sequence in r] from passively to actively 
star-forming galaxies. 

analysis we have instead chosen to divide the distribution in rj into four separate types, as shown in 
Fig. |. 

Having defined our 4 spectral types we can now proceed to calculate the luminosity function (LF) 
for each of these types using the standard formalisms of step-wise maximum likelihood Q and the 
maximum likelihood fit of a Schechter function (more detailed descriptions of this analysis have 
been given in previous work [^). These LFs are plotted in Fig. ^. 

A general trend is clear in that galaxies with relatively more current star-formation systematically 
have a fainter characteristic magnitude and a steeper faint-end slope. It is also interesting to note that 
whilst the Schechter function provides a good fit to the LFs of the late-type galaxies it does not fit 
the faint early-type galaxies well, perhaps suggesting the presence of a significant dwarf population. 
This feature is not evident in the overall LF and so would not have been found without having first 
performed this classification. 

3.3 Clustering per Spectral Type 

In order to investigate the different clustering properties of different types of galaxies we make use 
of the two point correlation function The calculated correlation function is shown in terms of 

the line-of-sight and perpendicular to the line-of-sight separation ^(a, tt) in Fig. ^. In this plot we 
show the correlation function calculated from the most passively (type 1) and actively (types 3 and 
4) star-forming galaxies. 

It can be seen from Fig. ^ that the clustering properties of the two samples are quite distinct. 
The more passively star-forming ('red') galaxies display a prominent 'finger of God' effect and also 
have a higher overall normalisation than the more actively star-forming ('blue') galaxies. This is a 
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Figure 5: The two point correlation function ^(cr, vr) plotted for passively (left) and actively (right) 
star-forming galaxies (right). The contour levels are 15,10,5,3,1 (labelled), 0.5,0. 

manifestation of the well-known morphology-density relation. This diagram allows us to determine 
the combination of the mass density and biasing parameter /3 = ^^/b (fl^) per spectral type (pi]], 

i) 

4 Conclusions 

The new generation of extremely large astronomical surveys currently taking place is ushering in a 
new era of precision astronomy. However, in order to reduce the vast quantities of information being 
produced into more practical data sets one must first address the challenges of compressing the data 
in a meaningful way. 

As an example of this we have demonstrated here in these proceedings how the large number 
of galaxy spectra being observed in the 2dFGRS can be reduced in the regime of a simple spectral 
classification, r]. We have addressed the issues of ensuring that this classification is meaningful and 
also robust to the known instrumental uncertainties. In addition we have shown how this classification 
can be applied to the calculation of luminosity and correlation functions in a way which adds new 
constraints on scenarios for galaxy formation and biasing. 
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